33.2.1 LiteLLM 简介#

LiteLLM 是一个开源的 LLM 网关，支持 100+ 个 LLM 提供商，包括 Anthropic、OpenAI、Cohere 等。它提供了统一的 API 接口，简化了多提供商的使用和管理。

LiteLLM 的核心特性#

多提供商支持：支持 100+ LLM 提供商
统一 API：一致的 API 接口，简化集成
智能缓存：内置缓存机制，减少成本和延迟
速率限制：可配置的速率限制，控制使用
成本跟踪：详细的使用情况和成本分析
负载均衡：在多个 API 密钥之间分配请求
失败重试：自动重试失败的请求
流式响应：支持流式输出

LiteLLM 架构#

┌─────────────────────────────────────────┐ │ Claude Code 客户端 │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LiteLLM Proxy │ │ ┌──────────────────────────────┐ │ │ │ API 层 │ │ │ │ (Anthropic、OpenAI 等) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 缓存层 │ │ │ │ (Redis、Memcached) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 监控层 │ │ │ │ (Prometheus、Grafana) │ │ │ └──────────────────────────────┘ │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LLM 提供商 │ │ (Anthropic、OpenAI、Cohere 等) │ └─────────────────────────────────────────┘

33.2.2 安装和配置#

1. 安装 LiteLLM#

使用 Docker 安装（推荐）


bash
bash

# 拉取 LiteLLM 镜像
docker pull litellm/litellm:latest

# 创建配置目录
mkdir -p ~/litellm/config
cd ~/litellm

# 创建配置文件
cat > config.yaml << EOF
model_list:
  - model_name: claude-sonnet-4
    litellm_params:
      model: claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-opus-4
    litellm_params:
      model: claude-opus-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-haiku-4
    litellm_params:
      model: claude-haiku-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  drop_params: true
  set_verbose: true

general_settings:
  master_key: sk-litellm-master-key-123456
  database_url: postgresql://user:password@localhost:5432/litellm

security_settings:
  valid_api_keys:
    - sk-team-a-key-123
    - sk-team-b-key-456
EOF

# 启动 LiteLLM
docker run -d \
  --name litellm \
  -p 4000:4000 \
  -v $(pwd)/config.yaml:/app/config.yaml \
  -e ANTHROPIC_API_KEY=sk-ant-xxx \
  litellm/litellm:latest

```#### 使用 Python 安装

# 安装 LiteLLM
pip install litellm[proxy]
# 初始化配置
litellm init
# 编辑配置文件
nano litellm_config.yaml
# 启动代理服务器
litellm proxy --config litellm_config.yaml --port 4000

2. 配置文件详解#


yaml
```yaml

# litellm_config.yaml

# 模型列表
model_list:
  # Anthropic Claude 模型
  - model_name: claude-sonnet-4
    litellm_params:
      model: claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      api_base: https://api.anthropic.com
      max_tokens: 4096
      temperature: 0.7

  - model_name: claude-opus-4
    litellm_params:
      model: claude-opus-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 4096

  - model_name: claude-haiku-4
    litellm_params:
      model: claude-haiku-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 4096

  # Amazon Bedrock 模型
  - model_name: bedrock-claude-sonnet
    litellm_params:
      model: anthropic.claude-sonnet-4-5-20250929-v1:0
      api_base: https://bedrock-runtime.us-east-1.amazonaws.com
      api_key: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  # Google Vertex AI 模型
  - model_name: vertex-claude-sonnet
    litellm_params:
      model: claude-sonnet-4-5@20250929
      api_base: https://us-central1-aiplatform.googleapis.com
      api_key: os.environ/GOOGLE_APPLICATION_CREDENTIALS
      vertex_project: os.environ/VERTEX_PROJECT_ID
      vertex_location: us-central1

# LiteLLM 设置
litellm_settings:
  drop_params: true              # 删除未使用的参数
  set_verbose: true              # 启用详细日志
  json_logs: true               # JSON 格式日志
  success_callback: http://localhost:5000/callback  # 成功回调
  failure_callback: http://localhost:5000/failure  # 失败回调

# 通用设置
general_settings:
  master_key: sk-litellm-master-key-123456  # 主密钥
  database_url: postgresql://user:password@localhost:5432/litellm  # 数据库 URL
  cache: redis://localhost:6379  # Redis 缓存
  cache_seconds: 3600  # 缓存时间（秒）

# 安全设置
security_settings:
  valid_api_keys:  # 有效的 API 密钥
    - sk-team-a-key-123
    - sk-team-b-key-456
    - sk-team-c-key-789
  max_budget: 1000.0  # 最大预算（美元）
  budget_duration: monthly  # 预算周期
  rpm_limit: 100  # 每分钟请求数限制
  tpm_limit: 10000  # 每分钟令牌数限制

# 负载均衡设置
load_balancing_settings:
  routing_strategy: usage-based  # 路由策略：usage-based, round-robin, least-latency
  health_check: true  # 启用健康检查
  health_check_interval: 60  # 健康检查间隔（秒）

# 监控设置
monitoring_settings:
  enable_prometheus: true  # 启用 Prometheus
  prometheus_port: 9090  # Prometheus 端口
  enable_slack_alerts: true  # 启用 Slack 告警
  slack_webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz
  alert_thresholds:
    error_rate: 0.05  # 错误率阈值
    latency_p99: 5000  # P99 延迟阈值（毫秒）

```## 33.2.3 高级配置

### 1. 缓存配置

# 缓存设置
cache_settings:
type: redis  # 缓存类型：redis, memory, none
redis_url: redis://localhost:6379/0
cache_ttl: 3600  # 缓存生存时间（秒）
cache_key_prefix: litellm  # 缓存键前缀
enable_cache_for_stream: false  # 是否为流式响应启用缓存
cache_control_headers: true  # 是否使用缓存控制头

2. 速率限制配置#


yaml
```yaml

# 速率限制设置
rate_limit_settings:
  enabled: true
  strategy: sliding_window  # 策略：sliding_window, token_bucket, fixed_window
  limits:
    - api_key: sk-team-a-key-123
      rpm: 100  # 每分钟请求数
      tpm: 10000  # 每分钟令牌数
      rpd: 10000  # 每天请求数
    - api_key: sk-team-b-key-456
      rpm: 50
      tpm: 5000
      rpd: 5000
  default_limits:
    rpm: 10
    tpm: 1000
    rpd: 100
  burst_size: 20  # 突发大小

```### 3. 预算控制配置

# 预算设置
budget_settings:
enabled: true
currency: USD
budgets:
- name: team-a-budget
api_keys:
- sk-team-a-key-123
limit: 1000.0
period: monthly
alert_threshold: 0.8  # 在 80% 时告警
hard_limit: true  # 达到限制时阻止请求
- name: team-b-budget
api_keys:
- sk-team-b-key-456
limit: 500.0
period: monthly
alert_threshold: 0.9
hard_limit: false
cost_tracking:
enabled: true
update_interval: 60  # 更新间隔（秒）
storage: database  # 存储方式：database, file

4. 监控和告警配置#


yaml
```yaml

# 监控设置
monitoring_settings:
  prometheus:
    enabled: true
    port: 9090
    metrics:
      - request_count
      - request_duration
      - error_count
      - cache_hit_rate
      - token_usage
      - cost

  grafana:
    enabled: true
    dashboard_url: http://localhost:3000/d/litellm

  alerts:
    slack:
      enabled: true
      webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz
      channels:
        - litellm-alerts
        - devops-notifications
      alert_rules:
        - name: high_error_rate
          condition: error_rate > 0.05
          duration: 5m
          severity: warning
        - name: high_latency
          condition: p99_latency > 5000
          duration: 2m
          severity: critical
        - name: budget_exceeded
          condition: budget_usage > 1.0
          severity: critical

    email:
      enabled: true
      smtp_server: smtp.gmail.com
      smtp_port: 587
      smtp_username: alerts@company.com
      smtp_password: ${SMTP_PASSWORD}
      from_address: litellm-alerts@company.com
      to_addresses:
        - devops@company.com
        - finance@company.com

```## 33.2.4 集成 Claude Code

### 1. 配置 Claude Code 使用 LiteLLM

# 方法 1：使用统一端点（推荐）
export ANTHROPIC_BASE_URL=https://litellm-server:4000
export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key
# 方法 2：使用 Anthropic 格式端点
export ANTHROPIC_BASE_URL=https://litellm-server:4000/anthropic
export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key

方法 3：使用 API 密钥辅助程序

创建辅助程序脚本

cat > ~/bin/get-litellm-key.sh << 'EOF' #!/bin/bash

从 Vault 获取密钥

vault kv get -field=api_key secret/litellm/claude-code EOF chmod +x ~/bin/get-litellm-key.sh

配置 Claude Code 使用辅助程序

cat > ~~/.claude-code/settings.json << EOF { "apiKeyHelper": "~~/bin/get-litellm-key.sh", "env": { "ANTHROPIC_BASE_URL": "https://litellm-server:4000" } } EOF


bash
### 2. 验证配置

```python
```python

class LiteLLMValidator:
    """LiteLLM 验证器"""

    def __init__(self, gateway_url: str, auth_token: str):
        self.gateway_url = gateway_url
        self.auth_token = auth_token

    def validate_connection(self) -> ValidationResult:
        """验证连接"""
        result = ValidationResult()

        try:
            # 测试健康检查端点
            response = requests.get(
                f"{self.gateway_url}/health",
                headers={'Authorization': f'Bearer {self.auth_token}'},
                timeout=10
            )

            if response.status_code == 200:
                result.success = True
                result.message = "Connection successful"
            else:
                result.success = False
                result.message = f"Health check failed: {response.status_code}"

        except requests.exceptions.Timeout:
            result.success = False
            result.message = "Connection timeout"
        except requests.exceptions.ConnectionError:
            result.success = False
            result.message = "Connection error"
        except Exception as e:
            result.success = False
            result.message = f"Unexpected error: {str(e)}"

        return result

    def validate_model_access(self, model: str) -> ValidationResult:
        """验证模型访问"""
        result = ValidationResult()

        try:
            # 测试模型访问
            response = requests.post(


bash
            f"{self.gateway_url}/v1/completions",
            headers={
                'Authorization': f'Bearer {self.auth_token}',
                'Content-Type': 'application/json'
            },
            json={
                'model': model,
                'prompt': 'Hello',
                'max_tokens': 10
            },
            timeout=30
        )

        if response.status_code == 200:
            result.success = True
            result.message = f"Model {model} accessible"
        else:
            result.success = False
            result.message = f"Model access failed: {response.status_code}"
            result.error = response.text

    except Exception as e:
        result.success = False
        result.message = f"Model access error: {str(e)}"

    return result

def validate_all(self) -> ValidationReport:
    """验证所有配置"""
    report = ValidationReport()

    # 验证连接
    report.connection = self.validate_connection()

    # 验证模型访问
    models = ['claude-sonnet-4', 'claude-opus-4', 'claude-haiku-4']
    report.models = {}

    for model in models:
        report.models[model] = self.validate_model_access(model)

    # 生成摘要
    report.summary = self._generate_summary(report)

    return report

def _generate_summary(self, report: ValidationReport) -> str:
    """生成验证摘要"""
    summary = "LiteLLM Validation Summary:\n\n"

    summary += f"Connection: {'✓' if report.connection.success else '✗'} "
    summary += f"{report.connection.message}\n\n"

    summary += "Model Access:\n"
    for model, result in report.models.items():
        status = '✓' if result.success else '✗'
        summary += f"  {status} {model}: {result.message}\n"

    return summary


bash
### 1. Prometheus 监控

# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'litellm'
static_configs:
- targets: ['litellm-server:9090']
metrics_path: '/metrics'

2. Grafana 仪表板#


json
```json

{
  "dashboard": {
    "title": "LiteLLM Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(litellm_request_count[1m])"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(litellm_error_count[1m]) / rate(litellm_request_count[1m])"
          }
        ]
      },
      {
        "title": "P99 Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, rate(litellm_request_duration_bucket[1m]))"
          }
        ]
      },
      {
        "title": "Cache Hit Rate",
        "targets": [
          {
            "expr": "rate(litellm_cache_hits[1m]) / rate(litellm_cache_requests[1m])"
          }
        ]
      },
      {
        "title": "Token Usage",
        "targets": [
          {
            "expr": "rate(litellm_token_usage[1m])"
          }
        ]
      },
      {
        "title": "Cost",
        "targets": [
          {
            "expr": "litellm_cost_total"
          }
        ]
      }
    ]
  }
}

```### 3. 日志管理

class LiteLLMLogManager:
"""LiteLLM 日志管理器"""
def __init__(self, log_file: str):
self.log_file = log_file
self.log_parser = LiteLLMLogParser()
def analyze_logs(self,
start_time: datetime = None,
end_time: datetime = None) -> LogAnalysis:
"""分析日志"""
analysis = LogAnalysis()
# 读取日志文件
with open(self.log_file, 'r') as f:
logs = f.readlines()
# 解析日志
parsed_logs = []
for log in logs:
try:
parsed = self.log_parser.parse(log)
parsed_logs.append(parsed)
except Exception as e:
logger.warning(f"Failed to parse log: {e}")
# 过滤时间范围
if start_time or end_time:
parsed_logs = [
log for log in parsed_logs
if (not start_time or log.timestamp >= start_time) and
(not end_time or log.timestamp <= end_time)
]
# 分析日志
analysis.total_requests = len(parsed_logs)
analysis.successful_requests = sum(
1 for log in parsed_logs if log.status == 'success'
)
analysis.failed_requests = sum(
1 for log in parsed_logs if log.status == 'error'
)
analysis.error_rate = (
analysis.failed_requests / analysis.total_requests
if analysis.total_requests > 0 else 0
)
# 分析延迟
latencies = [log.duration for log in parsed_logs if log.duration]
if latencies:
analysis.avg_latency = sum(latencies) / len(latencies)
analysis.p50_latency = np.percentile(latencies, 50)
analysis.p95_latency = np.percentile(latencies, 95)
analysis.p99_latency = np.percentile(latencies, 99)
# 分析令牌使用
analysis.total_tokens = sum(
log.input_tokens + log.output_tokens
for log in parsed_logs
)
# 分析成本
analysis.total_cost = sum(log.cost for log in parsed_logs)
return analysis
def generate_report(self, analysis: LogAnalysis) -> str:
"""生成报告"""
report = "LiteLLM Log Analysis Report\n"
report += "=" * 50 + "\n\n"
report += "Request Summary:\n"
report += f"  Total: {analysis.total_requests}\n"
report += f"  Successful: {analysis.successful_requests}\n"
report += f"  Failed: {analysis.failed_requests}\n"
report += f"  Error Rate: {analysis.error_rate:.2%}\n\n"
report += "Latency (ms):\n"
report += f"  Average: {analysis.avg_latency:.0f}\n"
report += f"  P50: {analysis.p50_latency:.0f}\n"
report += f"  P95: {analysis.p95_latency:.0f}\n"
report += f"  P99: {analysis.p99_latency:.0f}\n\n"
report += "Token Usage:\n"
report += f"  Total: {analysis.total_tokens:,}\n\n"
report += "Cost:\n"
report += f"  Total: ${analysis.total_cost:.2f}\n"
return report

Claude Code 深度教程

33.2 LiteLLM 网关部署